Highly Efficient Compensation-based Parallelism for Wavefront Loops on GPUs

نویسندگان

Kaixi Hou

Hao Wang

Wu-chun Feng

Jeffrey S. Vetter

Seyong Lee

چکیده

Wavefront loops are widely used in many scientific applications, e.g., partial differential equation (PDE) solvers and sequence alignment tools. However, due to the data dependencies in wavefront loops, it is challenging to fully utilize the abundant compute units of GPUs and to reuse data through their memory hierarchy. Existing solutions can only optimize for these factors to a limited extent. For example, tiling-based methods optimize memory access but may result in load imbalance; while compensation-based methods, which change the original order of computation to expose more parallelism and then compensate for it, suffer from both global synchronization overhead and limited generality. In this paper, we first prove under which circumstances that breaking data dependencies and properly changing the sequence of computation operators in our compensation-based method does not affect the correctness of results. Based on this analysis, we design a highly efficient compensation-based parallelism on GPUs. Our method provides weighted scanbased GPU kernels to optimize the computation and combines with the tiling method to optimize memory access and synchronization. The performance results on the NVIDIA K80 and P100 GPU platforms demonstrate that our method can achieve significant improvements for four types of real-world application kernels over the state-of-the-art research.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Model-Driven Tile Size Selection for DOACROSS Loops on GPUs

DOALL loops are tiled to exploit DOALL parallelism and data locality on GPUs. In contrast, due to loop-carried dependences, DOACROSS loops must be skewed first in order to make tiling legal and exploit wavefront parallelism across the tiles and within a tile. Thus, tile size selection, which is performance-critical, becomes more complex for DOACROSS loops than DOALL loops on GPUs. This paper pr...

متن کامل

Mapping dynamic programming algorithms on graphics processing units

The Graphics Processing Unit (GPU) is a highly parallel, many-core streaming architecture that can execute hundreds of threads concurrently. The data parallel architecture of the GPU is suitable to perform computation intensive applications. In recent years, the use of GPUs for general purpose computation has increased and a large set of problems can be tackled by mapping onto GPUs. The program...

متن کامل

Wavefront Scheduling : Path Based Data Representation andScheduling

The IA-64 architecture is rich with features that enable aggressive exploitation of instruction-level parallelism. Features such as speculation, predication, multiway branches and others provide compilers with new opportunities for the extraction of parallelism in programs. Code scheduling is a central component in any compiler for the IA-64 architecture. This paper describes the implementation...

متن کامل

Compilation of Modelica Array Computations into Single Assignment C for Efficient Execution on CUDA-enabled GPUs

Mathematical models, derived for example from discretisation of partial differential equations, often contain operations over large arrays. In this work we investigate the possibility of compiling array operations from models in the equation-based language Modelica into Single Assignment C (SAC). The SAC2C SAC compiler can generate highly efficient code that, for instance, can be executed on CU...

متن کامل

Efficient irregular wavefront propagation algorithms on hybrid CPU-GPU machines

We address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elemen...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2018

Highly Efficient Compensation-based Parallelism for Wavefront Loops on GPUs

نویسندگان

چکیده

منابع مشابه

Model-Driven Tile Size Selection for DOACROSS Loops on GPUs

Mapping dynamic programming algorithms on graphics processing units

Wavefront Scheduling : Path Based Data Representation andScheduling

Compilation of Modelica Array Computations into Single Assignment C for Efficient Execution on CUDA-enabled GPUs

Efficient irregular wavefront propagation algorithms on hybrid CPU-GPU machines

عنوان ژورنال:

اشتراک گذاری